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Abstract —Convolutional neural network (CNN) features which 
represent images with global and high-dimensional vectors have 
shown highly discriminative capability in image search. Although 
CNN features are more compact than local descriptors, they 
still cannot efficiently deal with the large-scale image search 
issue due to its non-negligible computational and storage cost. 
In this paper, we propose a simple but effective image indexing 
framework to decrease the computational and storage cost of 
CNN features. The proposed framework adapts Bag-of-Words 
model and inverted table to global feature indexing. To this 
end, two strategies, which are based on the semantic information 
inside CNN features, are proposed to convert a global vector to 
one or several discrete words. In addition, a number of strategies 
for compensating quantization error are fully investigated under 
the indexing framework. Extensive experimental results on three 
public benchmarks show the superiority of our framework. 

Index Terms —CNN, Indexing, Inverted table 

1. Introduction 

Image search Q 0 aims to find out relevant images of a 
user’s query from mass data, which is the fundamental prob¬ 
lem in many real-world scenarios. In the last two decades, the 
main effort focuses on improving the searching accuracy and 
efficiency. In essence, the searching accuracy is closely related 
to the features extraction, whereas the searching efficiency is 
depended mainly on the indexing structure. 

To represent the image content accurately, various feature 
extraction schemes have been reported |[^-||^, which can 
be coarsely separated into two groups: global features and 
local features. Global features such as Color Histogram 
Tamura Q and Moment Invariant 0, are often the statistics 
of images’ color, textual or shape information, where each 
image is described as a single and short vector. Generally, this 
kind of feature is compact and efficient for performing image 
search task. However, they cannot handle some complex image 
transformations like scale and layout changes, since they can 
only capture low-level information of images. In contrast, local 
features, e.g., SIFT and SURF describe an image with 
a set of local descriptors and have a better discriminative 
capability of different contents. However, their shortcoming is 
that the amount of local features extracted from even a small 
image dataset is huge. When the dataset is large scale, image 
search system based on local features will result in insufferable 
storage usage and computational cost. 

To speed up the searching process, various indexing 
schemes have also been studied sufficiently. For most of 
division based indexing structures such as R-tree |TQ| , M- 
tree GD and inverted table & they commonly divide the 
whole database into a series of non-overlapping partitions. 


Since only a small part of the whole dataset will be compared 
with the query image, the searching efficiency is improved 
significantly. Another bunch of indexing schemes is hashing 
based methods (T3) (2TJ, which learns a set of hash functions 
to map images from their original feature space into a distance¬ 
preserving binary space. Since fast bitwise operation can 
be employed to compare two binary codes, the searching 
efficiency can be boosted greatly. 

While many image search frameworks have been proposed, 
the current state-of-the-art is still based on bag-of-words 
(BoW) model and inverted table (T^, |[22|-||24|. However, 
this framework is only suitable for small or medium scale 
image search scenarios. When the database is large-scale, 
both the computational and storage efficiency will deteriorate 
sharply. Recently, global feature learning schemes based on 
deep learning (DP) technique are becoming popular. Instead 
of representing only the low-level visual information, this 
kind of feature encodes the high-level semantic meanings of 
image into a high-dimensional vector. For example, the global 
feature learned by AlexNet p5| , which is one of the famous 
Convolution Neural Network (CNN) frameworks and is named 
CNN feature here, consists of thousands of variables {e.g., 
lOOOD or 4098D). Although it is extremely discriminative and 
more compact than local representations, the computational 
and storage cost is non-negligible when image dataset is 
huge. Therefore, it is necessary to build an efficient indexing 
structure for large-scale applications . 

In this paper, we attempt to develop a new image indexing 
framework on basis of CNN features, where we adapt BoW 
model and inverted table to dealing with high-dimensional 
global features. The key contributions and novelties are sum¬ 
marized as follows: (1) A simple but effective image indexing 
framework is proposed to improve computational and storage 
efficiency. (2) Several strategies are proposed to decompose 
CNN features to visual words. Since BoW model is tailored 
to local features, some strategies need to be designed specially 
to map global features into visual words. To this end, two 
dictionary constructing methods, which depend on semantic 
information underlying CNN features, are proposed. (3) Sev¬ 
eral strategies are proposed to compensate the information 
loss caused by CNN feature decomposition. Commonly, a 
CNN feature is replaced with only one visual word, which 
will lead to large information loss. To address this issue, 
multiple link and multiple assignment strategies are applied 
in this paper. Extensive experimental results on two public 
benchmarks show the superiority of our framework. 
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Fig. 1. Illustration of Convolutional Neural Network model used in this paper. 


11. Related Work 


B. Inverted Table 


In this section, we firstly give the introduction about Con¬ 
volutional Neural Network (CNN), which is used in this paper 
for learning the global image features. Then we give a brief 
review of the existing image indexing structures based on the 
inverted table. 

A. Convolutional Neural Network 

Deep learning (DP) has drawn much attention from both 
academic and industry communities due to its excellent ca¬ 
pability in representing image content p6|-|[^. Hinton et 
al report a pioneering work on deep learning in p9| , which 
includes two key steps. First, neutrons are built layer by layer, 
which trains a single layer each time. Then “Wake-Sleep” 
algorithm is performed to optimize the whole network. By 
adjusting the weights in the network, “Wake-Sleep” algorithm 
makes sure that the final output represents the raw input as 
much as possible. X.Wang etc. p0| developed the DeepID, 
whose accuracy of face recognition is more than 99%, and 
identified faces are even more precise than the naked eye. 
K.Alex, S.Hya and E.H.Geoffrey p5| win the top-5 test error 
rate of 15.3% in the ILSVRC-2012 competition by using deep 
convolutional neural networks. In a word, deep learning has 
set off a wave of artificial intelligence. 

In this work, we extract the global image features by 
using the deep convolutional neural networks toolkit, named 
Caffe | [3T| . The network of Caffe takes an image as the input 
and generates a feature vector of 4096 or 1000 dimensions as 
output. The architecture of the network originated from p5} 
contains eight learned layers including five convolutional and 
three fully-connected layers. The whole architecture is illus¬ 
trated in Fig. [2 As shown in Fig. we use an image with 
three channels as the input. Every convolutional layer contains 
different size and numbers of the convolutional kernels. The 
kernels are 11 x 11 x 96,5 x 5 x 256,3 x 3 x 384,3 x 
3 X 384,3 X 3 X 256 in order. The overlapping pooling is 
acting behind the first, second and fifth convolutional layers. 
“Dropout” p2} , which is a technology of preventing co¬ 
adaptation of feature detectors, is used in the sixth, seventh 
and eighth fully-connected layers. We adopt the output of the 
sixth layer as the CNN features, which is a feature vector of 
4096 dimensions. 


The inverted table is initially used to index text documents 
in text retrieval area. For facilitating the image matching on 
basis of local features, Sivic 0 etc. extend this technique to 
image indexing via the BoW model p^ , f23\ , The 
key idea of BoW model is to construct a visual dictionary and 
represent each image with an orderless collection of visual 
words in the dictionary. In this way, the representation of an 
image is similar to a text document, and the inverted table can 
directly employed to index images. The basic framework of 
inverted table is shown in Fig. which depends on a fixed 
visual dictionary. For each visual word, a list is built, in which 
the ID information of images that contains the visual word is 
inserted. In the search phase, a voting process is performed to 
measure the similarity between query and database images. 

A lot of work has been done to improve the performance of 
inverted table. To effectively construct the visual dictionary, 
some fast clustering methods have been proposed to decrease 
the time cost of training dictionary p^ , p4|-p6|. In addition, 
binary embedding techniques have also been proposed to 
improve the search accuracy by adding additional information 
to the inverted table p6l-||4^. 







Inserting 


Fig. 2. The basic framework of inverted table. 

The traditional BoW model -i- inverted table framework is 
tailored for local image features. However, when image is 
represented by a single global feature, traditional inverted table 
is no longer suitable for indexing images due to the large 
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quantization error. Therefore, we have to take some strategies 
to adapt the traditional indexing scheme for fitting the global 


features. Details are given in section IV 


III. Overview of the Indexing Framework 
A. Problem Description 

Suppose we have a set of N images described by 
D-dimensional CNN features: X = {xi,..., xat}, where 
Xi G Our goal is to develop an effective 

and efficient image indexing framework so as to perform fast 
and accurate image search. A straightforward solution is to 
index the CNN features by employing hashing functions. In 
essence, hashing based schemes are to project images from 
their original feature space into a distance-preserving binary 
space. However, hashing global features of images will lead 
to large loss of discriminative capability. Therefore, we need 
to design a new image indexing framework. In this paper, we 
attempt to design the global feature indexing framework by 
borrowing some ideas from image indexing of local features. 
In classical local feature based image indexing framework, 
the key techniques are BoW model and inverted table. To 
adapt these techniques to global features, two main problems 
should be addressed. 


Problem 1: How to convert a high-dimensional global feature 
into a set of visual words? 


For traditional BoW model inverted table framework, 
local features of an image are first quantized into visual 
words, and then these visual words are indexed into inverted 
table. That is, this framework is tailored for local features. 
Nevertheless, CNN features are global features, we need to 
design some new strategies to map global features into visual 
words. 


Problem 2: How to avoid large loss of discriminative 
capability when mapping global features into visual words? 

When we convert global features into visual words, there 
exists quantization error between feature and visual word. 
Especially, when a global feature is mapped to only one visual 
word, the quantization error is larger. It clearly decreases the 
discriminative capability of features, since dissimilar images 
may be described by the same visual word. Therefore, ad¬ 
ditional information and strategies need to be introduced to 
enhance the discriminative capability of global features when 
indexed by inverted table. 


B. Solution Overview 

To solve problem 1, we propose two strategies to map a 
global CNN feature into one or several visual words. The 
first strategy depends on semantic information of different 
components in global vector. Since CNN features belong to a 
relative high-level feature, each component is of some specific 
semantic content. Therefore, each component can be treated 
as corresponded to a virtual concept word. In this way, a 
dictionary with a fixed size, which is equal to the length 


of CNN vector, can be constructed. In this situation, a CNN 
vector can be treated as a term frequency vector against the 
dictionary. For the second strategy, we directly employ the 
traditional methods to learn dictionaries. In our scheme, the 
Product Quantization (PQ) is used to learn visual dictionaries 
due to its low cost in computation and memory usage. 

To solve problem 2, four compensation strategies are pro¬ 
posed. First, we apply multiple link and multiple assignment 
strategies to reduce the quantization error. Since one image 
is represented by only one CNN feature, quantizing the CNN 
feature into only one word will lead to large information loss. 
Therefore, we employ multiple link strategy to index each 
database CNN feature to several inverted list in the indexing 
stage. Meanwhile, we also use multiple assignment strategy to 
map each query CNN feature into several visual words in the 
querying stage. For the third strategy, we propose to construct 
a series light-weight inverted tables for the database so as to 
alleviate the missed matches. To further reduce quantization 
error, binary embedding techniques are also introduced into 
the image indexing framework. Besides of visual words, each 
CNN feature is also associated with a compact binary code. 
In the querying stage, the returned images for a query will be 
further checked by calculating the hamming distance between 
query image and any returned image. If the distance is lower 
than a fixed threshold T, the returned image is reserved, 
otherwise ignored | [43| . 

The proposed image indexing framework is related to many 
existing techniques, such as BoW model, inverted table, binary 
embedding schemes. However, the key difference lies in that 
these related techniques are tailored to local feature based 
image search framework while our scheme adapts them to 
index global features. For example, the traditional inverted 
table is used in BOW model to index local features, in which 
each image is represented by a set of local features. While 
in our scheme, each image is described by only a singe CNN 
feature and the original inverted table must be adapted to index 
global features. 

Compared with the Brute Force (BF) search of CNN fea¬ 
tures, our schemes is clearly faster since the searching scope is 
greatly narrowed against the inverted table . More importantly, 
the memory usage is significantly reduced due to the compact 
description of visual words and embedding codes. 

IV. Details of the Proposed Framework 

In this section, we introduce the details of the proposed 
indexing framework. We firstly give the overview of the 
whole schemes, where the common components are discussed 
shortly. Then, separate techniques used in our proposed meth¬ 
ods are given in details. 

A. Overview of the Indexing Framework 

The final goal of our work is to use the inverted table to 
perform fast and accuracy image retrieval on CNN features. 
To this end, visual dictionaries are needed to be constructed 
for quantizing the CNN features firstly. In our indexing frame¬ 
work, several dictionary construction strategies are proposed, 
we discuss it latter. After a dictionary with K visual words 
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C = {ci, ...,cx} is constructed, the next step is to link the 
images to the corresponding position of the inverted table. 
For local feature based image indexing | [43| , since each local 
feature must be indexed to inverted table, most of existing 
schemes commonly link each local feature to only one visual 
word so as to avoid large storage usage. We call the link rule 
as single link (SL). In our scenario, each image is described 
by one vector. If each image is linked to only one visual word, 
the probability that two similar images that link to the same 
word will be relatively small. Therefore, in all our schemes, 
each image is linked to its S nearest words in the indexing 
stage, and we call the link rule as multiple link (ML). 

In the search phase, the query image Xq can be assigned 
to the nearest visual word and be linked to corresponding 
list. We name the process that assign an query image to only 
one visual word as single assignment (SA). As in | [43l , we 
can also assign a global CNN feature to W visual words and 
return all images linked to these words. Similar to | [43| , we call 
this process as multiple assignment (MA). After that, a voting 
process is performed according to the frequency that an image 
is returned. Notice that SL and SA are called hard assignment, 
and ML and MA are called soft assignment in | [43| , | |44| . 


B. CNN Feature as Term Frequency Vector 

The key for constructing an inverted table is to build a dic¬ 
tionary and to convert global features into a set of words in the 
dictionary. Since each component of the CNN feature vector 
is of some semantic content, we can treat each dimension as 
corresponded to a virtual concept word. Therefore, a dictionary 
whose size is equal to length of feature vector is naturally built 
as shown in Fig. Since term frequency is the probability 
of each word appearance in a document, the CNN feature 
vector is translated into a term frequency (probability) vector. 
In our scheme, given a CNN feature vector x = (x^,..., x^)^, 
we further deal with each component by using the following 
softmax function: 






( 1 ) 


In this way, a CNN feature vector can be translated to a 
histogram cr{x) = (cr(x^)i,..., cr(x^)i))^, where each bin 
indicates the probability that corresponding word appears in 
the image. If without any post-processing step, D items will 
be inserted into the inverted table with D lists, which will lead 
to a large memory cost. To avoid this problem, only the words 
corresponding to top S bin values are taken into account. It is 
reasonable, since the top value bins describe the main content 
of the image. 

Since too many CNN features are linked to a single po¬ 
sition, the quantization error is big. To further improve the 
search accuracy, binary embedding codes of CNN features 
are calculated and introduced into the inverted table structure. 
Assume image x is linked to the virtual visual word c and 
we want to generate a binary code with length L. First, we 
need to partition x and c into L parts with equal length. The 



Fig. 3. Illustration of TIFC indexing structure. 


following formulates the partitioning process: 

X = (a;(i),...,a;(i))^ (2) 

C (^(1)5 (^) 

where 


^(i) — (^(2-l)x(D/L) + l5 •••5^ix(i:>/L))^ 

C(i) = {C{i-l)xiD/L) + l, •••5 Cix(D/L))^ 

Then the binary code / = (/^,..., /^)^ is calculated as follows: 

= 1 if mean{x(^i^) > mean{c(^if) ^ 

=0 if mean{x(^if) < mean{c(^if) 

In this scheme, since D visual words are virtual words, we 
randomly generate D-dimensional vectors to represent these 
virtual words so as to calculate the binary code. Experiments 
show that it has no effect on the retrieved results. We call this 
scheme as “TIFC” for short. For memory saving, the binary 
embedding codes are stored in bits. 


C. Index Large Scale of CNN Features 

Using the above-mentioned method, only a small-scale 
dictionary with thousands of virtual words can be used to con¬ 
struct the inverted table. However, when the image database 
is very large-scale, both the effectiveness and efficiency of 
the indexing structure will be deteriorated. Therefore, it is 
necessary to build a large-scale visual dictionary. Taking into 
account the efficiency in training and mapping, we employ 
product quantization (PQ) | [36| , | [43| method to learn a large- 
scale visual dictionary from a training set of global CNN 
features. In particular, the CNN features in training dataset is 
firstly partitioned into M segments with equal length, i.e., x = 
(x(i),..., Then k-means clustering method is performed 

on each vector segment to get a sub-codebook with K 
sub-words. By using the Cartesian product, all the M sub¬ 
dictionaries (or sets) are combined to construct a final visual 
dictionary with visual words, i.e., C = x ... x 
Once we have built the large-scale dictionary, an inverted table 
can be constructed against it. Similarly, binary embedding 
codes are associated with the inverted table by using the Eq. 
We call this method “lEC” for short, whose framework is 
shown in Eig. 
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Fig. 4. Illustration of IFC indexing structure. 


V. Experiments 

We represent the experimental results and analysis in this 
section. To validate the effectiveness of our proposed schemes, 
we perform our experiments on three benchmark datasets. Be¬ 
sides, a large scale dataset is added into the three benchmarks 
as the distractor set to test the performance of the proposed 
schemes on big data environment. 


A. Experimental Setup 

In our experiment. Holiday, Oxford and UKbench are em¬ 
ployed as the three benchmark datasets. When testing on large 
scale datasets, MIRFlickr is added into each benchmark as a 
distractor set. All the CNN features are extracted by using 
Caffe (311 toolkit with default model. 

Holiday is an image dataset containing some personal 
holiday photos. It contains 1,491 images in total separated 
by 500 groups, where each group represents a specific scene 
or object. The first image of each group is the query image 
and all images of the group are the correct retrieval results. 

Oxford is a building datasets consists of 5,062 images 
downloaded from Flickr by searching for particular Oxford 
landmarks. The dataset is manually annotated to generate a 
comprehensive ground truth for 11 different landmarks, each 
represented by 5 possible queries. That gives 55 query images 
in total. 

UKbench is a dataset of 10,200 images. Each 4 images 
are in a group, which are the photos of the same object from 
different viewpoints. The first image of each group is collected 
as a query set, which contains 2,550 images in total. 

MIRFlickr dataset contains 1,000,000 images downloaded 
from Flickr. In our experiments, it is taken as the distractor 
set to test the performance for large scale image search, (to 
be modified) 

We use Mean Average Precision (MAP) as the accuracy 
measure. Average Precision (AP) | [43| is corresponded to the 
ranks of the ground-truth in the returned list. After getting APs 
of all queries, the mean value is calculated as the final score. 


B. Comparision Methods 


Two schemes proposed in Section |IV] are compared in 
accuracy, speed and storage cost. Besides, two baseline are 
added into comparison. One of them is the brute-force (BF) 
scheme, which uses CNN features to perform search directly. 
Another is Locality Sensitive Hashing (LSH) fT3| , which is a 
famous hashing scheme. 


C. Ejfect of Key Parameters 


There are four main factors that affect the schemes, the bi¬ 
nary code length L, the link number S', the assignment number 
W and the distance threshold T. To test the sensitivity of the 
proposed schemes to these factors, we perform experiments 
on the three benchmarks without distractor set. 

The effect of factor L on there datasets is illustrated in 
Fig. . Clearly, a medium length of binary code achieves the 
best performance. In fact, it is reasonable to get this conclusion 
since a medium binary code can better balance missed matches 
and false matches. In the following experiments, we set L = 
512 for all schemes. 

To test the sensitivity of factor T, we also carry out a group 
of experiments on holiday dataset. The experimental results are 
shown in Fig.[^ When the threshold is below to 180, the MAP 
score increases with the threshold. This is because when the 
threshold is small, many correct results are wrongly rejected 
since the binary code is just a similar representation of visual 
feature. In contrast, when the threshold is too large, it also lose 
effect since too many wrong results are reserved. Empirically, 
when the threshold is about 30% of code length, the result is 
the best. The experiments have validated this conclusion. 

For the factors S and W in IFC, there is strong correla¬ 
tion between multiple link and multiple assignment. We can 
observe that when S' = IF, the performance is best. This is 
because when S = VP, the probability of finding the most 
correct images is biggest. When S > VP, it is more possible 
for the query to lose the most correct images. In contrast, when 
S < W, too wrong images will be returned. Due to the page 
limitation, we give the experimental results in Section |V-D to 
show this conclusion. Here, we force the value of S to be equal 
to W. The experimental results are illustrated in Fig[^ Clearly, 
the bigger of S and VP, the better of search accuracy, while 
the search time increases linearly with the factors. In fact, the 
accuracy improvement trends to be smooth when S and W go 
up to a certain value. As shown in Fig. [ 7 ] when the values of 
S and W are bigger than 40, the value of MAP tends to be 
stable. This means that we can significantly improve search 
accuracy with relative small time cost. 


D. Interrelationship between ML and MA 


We conduct experiments under different combinations of 
S and W to evaluate the effect of multiple assignment and 
multiple link. The experimental results are shown in Fig. 
respectively. We set the binary embedding code length L = 
512 and the threshold T = 180, the best value of factors. 

Clearly, when the value of S (or W) is fixed, the image 
searching accuracy is improved gradually with increasing 
the value of W (or S). The performance reaches the 
best when the value of W (or S) goes up to value of 
S (or VP). After that, the MAP score tends to be stable 
even increasing W (or S). The possible reason may lie 
in that the probability of finding correct images is the 


biggest when VP = S' as we have explained in Section V-C 


We summarize the observations as the two conclusions below: 
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(a) 


(b) 


(c) 


Fig. 5. Illustration of MAP value variation with code length on (a) Holiday, (b) Oxford and (c) UKbench datasets. 



(a) (b) (c) 

Fig. 6. Illustration of MAP value variation with distance threshold on (a) Holiday, (b) Oxford and (c) UKbench datasets. 


Conclusion 1: When fix the ML number S (or the MA 
number W), the MAP value increases with the growth of 
MA number W (or the ML number S). 

Conclusion 2: When VL = S', we can get the best trade-off, 
and a relative larger value of W = S can reach the higher 
accuracy. 

Similar conclusions can be got from TIFC scheme. The 
experimental results on three benchmarks are illustrated in 
Fig. However, there is a little difference between IFC 
and TIFC schemes. For TIFC scheme, the best performance 
cannot be obtained at the IL = S condition, since the link 
strategy of TIFC is depended on the descriptiveness of CNN 
components. In fact, when S goes up to large value (e.g., 24), 
the performance tends to be stable. This means that the top 24 
components of CNN feature mostly describe an image. The 
choice of W can be larger than S. When S is fixed, searching 
accuracy is improved with the growth of W. The performance 
tends to be stable when W goes up to large value (e.g., 45). 


E. Comparison of Different Schemes 

In this subsection, the experimental results of various 
schemes are compared together. We carry out a series of exper¬ 
iments on the benchmarks with and without the distractor set. 
For all proposed schemes, the factors are set the best according 
to the conclusion in Sections |V-C| and |V-D| Besides, two 
baselines, i.e.. Brute Force and LSH are added for comparison. 

The evaluation of image search accuracy on Holiday dataset 
are shown in Table |I] Clearly, brute-force search outperforms 
all our indexing schemes. This is not surprising since the 
quantization error unavoidably decreases the discriminative 
capability of the original CNN features. In fact, using the 
proposed indexing schemes, we can also achieve excellent per¬ 
formance. For example, IFC scheme obtains a high accuracy of 
0.715185. Interestingly, there are no remarkable performance 
deterioration after introducing a large distractor set. 

In our work, the main advantages of the proposed indexing 
framework lie in high efficiency and less storage usage. As 
shown in Table all our schemes significantly improve 
the searching efficiency in a large-scale environment. More 
importantly, the searching time is sub-linear with the growth 
of database. For example, the searching time is only doubled 
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Fig. 7. Illustration of MAP value variation with multiple link and assignment numbers when they are equal on (a) Holiday, (b) Oxford and (c) UKbench 
datasets. 



Fig. 8. Illustration of MAP value variation with different multiple link and assignment numbers for TIFC scheme on (a) Holiday, (b) Oxford and (c) UKbench 
datasets. 


TABLE I 

Evaluation of searching accuracy on three benchmarks. 


Method 

Dataset 

Holiday 

Holiday+Flickr 1M 

Oxford 

Oxford+FlickrlM 

UKbench 

UKbench+FlickrlM 

BF 

0.572769 

0.489588 

0.318517 

0.255718 

0.722123 

0.657939 

LSH 

0.603356 

0.490688 

0.331508 

0.261679 

0.746036 

0.697376 

TIFC 

0.513885 

0.391899 

0.215950 

0.165186 

0.627587 

0.582129 

IFC 

0.541392 

0.486320 

0.262432 

0.237945 

0.715185 

0.700910 


for IFC scheme, while the database has been increased about 
a thousand-fold. Notice that when the benchmark is used 
without the distractor set, we can see that BF is not the worst. 
However, it is reasonable. Since the volume of Holiday dataset 
is 1,491, BF needs only to search 1,491 CNN vectors to get 
the results. Nevertheless, IFC needs to search two codebooks 
with 1,000 words and to find the cell ID for each in inverted 
table. Therefore, the initial step is nontrivial for small dataset. 

The memory usage of each scheme when performing search 
on Holiday is shown in Table [nl| For small dataset, since 
most of memory is used to store the visual dictionaries, the 
proposed schemes take more space. When the dataset is huge. 


the memory usage of all our scheme will far lower than the 
BF scheme. For example, IFC only takes 3523.32M memory 
space, compared to 14085.SOM of BF. 

VI. Conclusion 

In this paper, we propose a simple but effective image 
indexing framework to improve computational and storage 
efficiency of global CNN features. Inspired by local feature 
indexing methods, we adapt the traditional BoW model -f 
inverted table framework to global feature indexing. Fully 
exploring the semantic information underlying components of 
CNN vector, we propose two visual dictionary construction 
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Fig. 9. Illustration of MAP value variation with different multiple link and assignment numbers for IFC scheme on (a) Holiday, (b) Oxford and (c) UKbench 
datasets. 


TABLE II 

Evaluation of Search Time (seconds) on three benchmarks. 


Method 

Dataset 

Holiday 

Holiday+Flickr 1M 

Oxford 

Oxford+FlickrlM 

UKbench 

UKbench+FlickrlM 

BF 

31.484 

13379.845 

29.297 

1383.581 

88.270 

62765.156 

LSH 

106.113 

258.002 

26.488 

108.477 

36.755 

1327.755 

TIFC 

30.461 

147.010 

1.357 

17.195 

27.861 

1134.402 

IFC 

172.880 

202.618 

26.488 

108.477 

36.755 

1327.755 


TABLE III 

Evaluation oe Storage Usage (MB) on three benchmarks. 


Method 

Dataset 

Holiday 

Holiday+Flickr 1M 

Oxford 

Oxford+FlickrlM 

UKbench 

UKbench+FlickrlM 

BF (baseline) 

23.30 

14085.80 

23.30 

14085.80 

23.30 

14085.80 

LSH 

628.87 

2963.46 

628.87 

2963.46 

628.87 

2963.46 

TIFC 

65.23 

806.80 

65.23 

806.80 

65.23 

806.80 

IFC 

21.43 

3523.32 

21.43 

3523.32 

21.43 

3523.32 


methods to map global CNN features to discrete words. To 
alleviate the quantization error, four compensation strategies 
are fully investigated in our indexing framework. Extensive 
experimental results shown that the proposed framework sig¬ 
nificantly improves computational and storage efficiency with 
an acceptable loss of precision. 
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